Beyond RAG — going from search to analytics on unstructured data with Aryn
Mehul Shah • Location: Theater 7 • Back to Haystack 2025
“Over the past year generative AI models of GPT-4 quality have gotten 50-80x cheaper and 10-20x faster. Given this trend, LLMs have the potential to go beyond RAG and run complex semantic analyses on unstructured data at scale. Consider, for example, marketing analysts in pharmaceuticals that want to analyze thousands of interviews to assess key factors in adoption of medications by market segment, or financial analysts that want to perform due diligence across thousands of reports to form an investment thesis. Solving for these use cases requires systems that efficiently sweep through large datasets, harvest high quality metadata at query time, and synthesize results.
Towards this, we motivate and describe the design of an AI-powered unstructured data warehouse, eponymously named Aryn. With Aryn, users specify queries in natural language and the system automatically generates and executes a semantic plan across a large collection of unstructured documents. To accomplish this, it orchestrates data through a collection of visual AI models and LLMs using the Sycamore open source library.
In this talk, I will demonstrate analytics queries over real world reports from the National Transportation Safety Board (NTSB). I will walk through end-to-end how the system ingests and indexes its data using Sycamore and OpenSearch, and plans and executes queries to achieve much better accuracy than RAG approaches. Also, given current limitations of LLMs, we argue that an analytics system must be verifiable to be practical. Toward this, we show how Aryn’s user interface provides explainability through lineage and execution traces to help build trust.”
Mehul Shah
Aryn Inc